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Abstract 

In this work we propose a technique that transfers su¬ 
pervision between images from different modalities. We use 
learned representations from a large labeled modality as 
a supervisory signal for training representations for a new 
unlabeled paired modality. Our method enables learning 
of rich representations for unlabeled modalities and can 
be used as a pre-training procedure for new modalities 
with limited labeled data. We show experimental results 
where we transfer supervision from labeled RGB images 
to unlabeled depth and optical flow images and demon¬ 
strate large improvements for both these cross modal su¬ 
pervision transfers. Code, data and pretrained models 
are available at https://github.com/s-gupta/ 
fast-rcnn/tree/distillation. 


1. Introduction 

Current paradigms for recognition in computer vision in¬ 
volve learning a generic feature representation on a large 
dataset of labeled images, and then specializing or finetun- 
ing the learned generic feature representation for the spe¬ 
cific task at hand. Successful examples of this paradigm 
include almost all state-of-the-art systems: object detection 
[13], semantic segmentation [36], object segmentation [19], 
and pose estimation [49], which start from generic features 
that are learned on the ImageNet dataset [6] using over a 
million labeled images and specialize them for each of the 
different tasks. Several different architectures for learning 
these generic feature representations have been proposed 
over the years [31, 44, 3], but all of these rely on the avail¬ 
ability of a large dataset of labeled images to learn feature 
representations. 

The question we ask in this work is, what is the analogue 
of this paradigm for images from modalities which do not 
have such large amounts of labeled data? There are a large 
number of image modalities beyond RGB images which are 
dominant in computer vision, for example depth images 
coming from a Microsoft Kinect, infra-red images from 
thermal sensors, aerial images from satellites and drones. 



Image layer 

Figure 1: Architecture for supervision transfer: We train a 
CNN model for a new image modality (like depth images), by 
teaching the network to reproduce the mid-level semantic repre¬ 
sentations learned from a well labeled image modality (such as 
RGB images) for modalities for which there are paired images. 

LIDAR point clouds from laser scanners, or even images 
of intermediate representations output from current vision 
systems e.g. optical flow and stereo images. The number 
of labeled images from such modalities are at least a few 
orders of magnitude smaller than the RGB image datasets 
used for learning features, which raises the question: do we 
need similar large scale annotation efforts to learn generic 
features for images from each such different modality? 

We answer this question in this paper and propose a tech¬ 
nique to transfer learned representations from one modal¬ 
ity to another. Our technique uses ‘paired’ images from 
the two modalities and utilizes the mid-level representations 
from the labeled modality to supervise learning representa¬ 
tions on the paired un-labeled modality. We call our scheme 
supervision transfer and show that our learned representa¬ 
tions perform well on standard tasks like object detection. 
We also show that our technique leads to learning useful 
feature hierarchies in the unlabeled modality, which can be 
improved further with finetuning, and are still complemen¬ 
tary to representations in the source modality. 

As a motivating example, consider the case of depth im¬ 
ages. While the largest labeled RGB dataset, ImageNet [6] 
consists of over a million labeled images, the size of most 
existing labeled depth datasets is of the order of a few thou¬ 
sands [42, 46, 26]. At the same time there are a large num¬ 
ber of unlabeled RGB and depth image pairs. Our technique 
leverages this large set of unlabeled paired images to trans- 
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fer the ImageNet supervision on RGB images to depth im¬ 
ages. Our technique is illustrated in Figure 1. We use a con¬ 
volutional neural network that has been trained on labeled 
images in the ImageNet dataset [6], and use the mid-level 
representation learned by these CNNs as a supervisory sig¬ 
nal to train a CNN on depth images. Our technique for trans¬ 
ferring supervision results in improvements in performance 
for the end task of object detection on the NYUD2 dataset, 
where we improve the state-of-the-art from 34.2% to 41.7% 
when using just the depth image and from 46.2% to 49.1% 
when using both RGB and depth images together. We report 
similar improvements for the task of simultaneous detection 
and segmentation [19] and also show how supervision trans¬ 
fer can be used for a zero-shot transfer of object detectors 
trained on RGB images to detectors that can run on depth 
images. 

Though we show detailed experimental results for su¬ 
pervision transfer from RGB to depth images, our technique 
is equally applicable to images from other paired modali¬ 
ties. To demonstrate this, we show additional transfer re¬ 
sults from RGB images to optical flow images where we 
improve mean average precision for action detection on the 
JHMDB dataset [27] from 31.7% to 35.7% when using just 
the optical flow image and no supervised pre-training. 

Our technique is reminiscent of the distillation idea from 
Hinton et al. [22] (and its recent FitNets extension [39]). 
Hinton et al. [22] extended the model compression idea 
from Bucilua et al. [2] to what they call ‘distillation’ and 
showed how large models trained on large labeled datasets 
can be compressed by using the soft outputs from the large 
model as targets for a much smaller model operating on 
the same modality. Our work here is a generalization of 
this idea, and a) allows for transfer of supervision at arbi¬ 
trary semantic levels, and b) additionally enables transfer of 
supervision between different modalities using paired im¬ 
ages. More importantly, our work here allows us to extend 
the success of recent deep CNN architectures to new imag¬ 
ing modalities without having to collect large scale labeled 
datasets necessary for training deep CNNs. 

2. Related Work 

There has been a large body of work on transferring 
knowledge between different visual domains, belonging to 
the same modality. Initial work [32, 15, 1,8, 24] studied the 
problem in context of shallow image representations. While 
[32, 15] sought to learn transformations between well la¬ 
beled source and sparsely labeled target domains, Aytar et 
al. [ 1 ] use the source models as a parameter regularizer for 
target models, [8, 24] combine these two approaches into a 
single joint optimization problem. Chopra et al. [4] intro¬ 
duced one of the first deep architectures for visual adapta¬ 
tion by replicating feature extraction for each domain and 
producing intermediate interpolated domains, while Ghi- 


fary et al. [11] showed a single layer neural net could be 
used to learn the feature transformation between simple do¬ 
main shifts. 

More recently, with the introduction of supervised CNN 
models by Krizhevsky et al. [31], the community has been 
moving towards a generic set of features which can be spe¬ 
cialized to specific tasks and domains at hand [7, 13, 12, 
40, 23] and traditional visual adaptation techniques can be 
used in conjunction with such features [25]. More recently, 
unsupervised domain adaptation techniques have been in¬ 
troduced which learn to adapt deep representations so as 
to minimize the discrepancy between the source and target 
distributions [50, 10, 37]. 

All these lines of work study and solve the problem 
of domain adaptation within the same modality. In con¬ 
trast, our work here tackles the problem of domain adap¬ 
tation across different modalities. Most methods for intra¬ 
modality domain adaptation described above start from an 
initial set of features on the target domain, and a priori 
it is unclear how this can be done when moving across 
modalities, limiting the applicability of aforementioned ap¬ 
proaches to our problem. This cross-model transfer prob¬ 
lem has received much less attention. Notable among those 
include [5, 38, 48, 45, 9]. While [5, 48] hallucinate modali¬ 
ties during training time, [38, 45, 9] focus on the problem of 
jointly embedding or learning representations from multiple 
modalities into a shared feature space to improve learning 
[38] or enabling zero-shot leaming[45, 9]. Our work here 
instead transfers high quality representations learned from 
a large set of labeled images of one modality to completely 
unlabeled images from a new modality, thus leading to a 
generic feature representations on the new modalities which 
we show are useful for a variety of tasks. 

3. Supervision Transfer 

Let us assume we have a modality with unlabeled 
data, Dd for which we would like to train a rich representa¬ 
tion. We will do so by transferring information from a sepa¬ 
rate modality, A4s, which has a large labeled set of images. 
Vs , and a corresponding K layered rich representation. We 
assume this rich representation is layered although our pro¬ 
posed method will work equally well for non-layered rep¬ 
resentations. We use convolutional neural networks as our 
layered rich representation. 

We denote this image representation as 4> = 

e [1 ■ ■ ■ K]}. is the layer represen- 

tation for modality Ais which has been trained on labeled 
images from dataset Ds , and it maps an input image from 
modality Aig to a feature vector in . 

• Ms ^ ( 1 ) 

Feature vectors from consecutive layers in such layered 
representations are related to one another by simple opera¬ 
tions like non-linearities, convolutions, pooling, normaliza- 
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tions and dot products (for example layer 2 features may be 
related to layer 1 features using a simple non-linearity like 
max with 0: = max(0, (x))). Some 

of these operations like convolutions and dot products have 
free parameters. We denote such parameters associated with 
operation at layer i by wi. The structure of such architec¬ 
tures (the sequence of operations, and the size of representa¬ 
tions at each layer, etc.) is hand designed or validated using 
performance on an end task. Such validation can be done 
on a small set of annotated images. Estimating the model 
parameters wl is much more difficult. The number of these 
parameters for most reasonable image models can easily go 
up to a few millions. Heretofore, state-of-the-art models 
require discriminative learning of these parameters using a 
large labeled training set. 

Now suppose we want to learn a rich representation for 
images from modality Md, for which we do not have ac¬ 
cess to a large dataset of labeled images. We assume we 
have already hand designed an appropriate architecture = 
{'^Md ^ [1... T]}. The task then is to effectively learn 

the parameters associated with various operations in the ar¬ 
chitecture, without having access to a large set of annotated 
images for modality M-d- As before, we denote these pa¬ 
rameters to be learned by W^' ■•■^1 = Vi G [1...L]} 

In addition to Ds, let us assume that we have access to 
a large dataset of un-annotated paired images from modal¬ 
ities Ms and Md- We denote this dataset hy Ug^d- By 
paired images we mean a set of images of the same scene 
in two different modalities. Our proposed scheme for train¬ 
ing rich representations for images of modality Md is to 
learn the representation such that the image representa¬ 
tion 'ilj^{Id) for image Id matches the image representation 
Ds image pair Ig in modality Mg for some 

chosen and fixed layer G [1 ... K]. We measure the simi¬ 
larity between the representations using an appropriate loss 
function / (for example, euclidean loss). Note that the rep¬ 
resentations and may not have the same dimen¬ 
sions. In such cases we embed features into a space 
with the same dimension as using an appropriate sim¬ 
ple transformation function t (for example a linear or affine 
function). 

E] f P , 4 >M„dSC)) (2) 

(isJd)eu,,a 

We call this process supervision transfer from layer in $ 
of modality Mgio layer L in of modality Md^ 

The recent distillation method from Hinton et al. [22] is 
a specific instantiation of this general method, where a) they 
focus on the specific case when the two modalities Mg and 
Md are the same and b) the supervision transfer happens at 
the very last prediction layer, instead of an arbitrary internal 
layer in representation 

Our experiments in Section 4 demonstrate that this pro¬ 
posed method for transfer of supervision is a) effective at 


learning good feature hierarchies, b) these hierarchies can 
be improved further with finetuning, and c) the resulting 
representation can be complementary to the representation 
in the source modality Mgii the modalities permit. 

4. Experiments 

In this section we present experimental results for the 
NYUD2 dataset where we use color and depth images as 
the paired modalities, and on the JHMDB video dataset for 
which we use the RGB and optical fiow frames as the two 
modalities. 

Our general experimental framework consists of two 
steps. The first step is supervision transfer as proposed in 
Section 3, and the second step is to assess the quality of 
the transferred representation by using it for a downstream 
task. For both of the datasets we study, we consider the 
domain of RGB images sls Mg for which there is a large 
dataset of labeled images Dg in the form of ImageNet [6], 
and treat depth and optical fiow respectively as These 
choices for Mg and Md are of particular practical signif¬ 
icance, given the lack of large labeled datasets for depth 
images and optical fiow, at the same time, the abundant 
availability of paired images coming from RGB-D sensors 
(for example Microsoft Kinect) and videos on the Internet 
respectively. 

For our layered image representation models, we use 
convolutional neural networks (CNNs) [33, 31]. These net¬ 
works have been shown to be very effective for a variety 
of image understanding tasks [7]. We experiment with the 
network architectures from Krizhevsky et al. [3 1 ] (denoted 
AlexNet), Simonyan and Zisserman [44] (denoted VGG), 
and use the models pre-trained on ImageNet [6] from the 
Caffe [28] Model Zoo. 

We use an architecture similar to [31] for the layered rep¬ 
resentations for depth and fiow images. We do this in order 
to be able to compare to past works which learn features on 
depth and fiow images [17, 14]. Validating different CNN 
architectures for depth and fiow images is a worthwhile sci¬ 
entific endeavor, which has not been undertaken so far, pri¬ 
marily because of lack of large scale labeled datasets for 
these modalities. Our work here provides a method to cir¬ 
cumvent the need for a large labeled dataset for these and 
other image modalities, and will naturally enable exploring 
this question in the future, however we do not delve in this 
question in the current work. 

We next describe our design choices for which layers to 
transfer supervision between, and the specification of the 
loss function / and the transformation function t. We ex¬ 
perimented with what layer to use for transferring supervi¬ 
sion, and found transer at mid-level layers works best, and 
use the last convolutional layer pools for all experiments 
in the paper. Such a choice also resonates well with obser¬ 
vations from [34] that lower layers in CNNs are modality 
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Does supervision transfer work? How good is the transferred representation by itself? Are the representations complementary? 


Exp. lA 

no init 

22.7 

Exp. 2A 

copy from RGB (ft f c only) 

19.8 

Exp. 3A 

[RGB]: RGB network on 

RGB images AlexNet 

22.3 

Exp. IB 

copy from RGB 

25.1 

Exp. 2B 

supervision transfer (ft f c only) 
AlexNet * —)■ AlexNet 

30.0 

Exp. 3B 

[RGB] + copy from RGB 

33.8 

Exp. 1C 

supervision transfer 
AlexNet —)■ AlexNet 

29.7 

Exp. 2C 

supervision transfer (ft f c only) 
VGG * ^ AlexNet 

32.2 

Exp. 3C 

[RGB] + supervision transfer 
AlexNet * —?■ AlexNet 

35.6 

Exp. ID 

supervision transfer 
AlexNet * —)• AlexNet 

30.5 

Exp. 2D 

supervision transfer 

VGG * ^ AlexNet 

33.6 

Exp. 3D 

[RGB]+ supervision transfer 
VGG * ^ AlexNet 

37.0 


Table 1: We evaluate different aspects of our supervision transfer scheme on the object detection task on the NYUD2 val set using the 
mAP metric. Left column demonstrates that our scheme for pre-training is better than alternatives like no pre-training, and copying over 
weights from RGB networks. The middle column demonstrates that our technique leads to transfer of mid-level semantic features which by 
themselves are highly discriminative, and that improving the quality of the supervisory network translated to improvements in the learned 
features. Finally, the right column demonstrates that the learned features on the depth images are still complementary to the features on the 
RGB image they were supervised with. 


specific (and thus harder to transfer across modalities) and 
visualizations from [13] that neurons in mid-level layers are 
semantic and respond to parts of objects. Transferring at 
pools also has the computational benefit that training can 
be efficiently done in a fully convolutional manner over the 
whole image. 

For the function /, we use L2 distance between the fea¬ 
ture vectors, /(x, y) = ||x — y Hi- We also tried /(x, y) = 
l(y > ' logp(x) + l(y ^ ~ P(^)) (where 

p{x) = indicator function), for some 

reasonable choices of a and r but that resulted in worse per¬ 
formance in initial experiments and we did not experiment 
with it further. 

Finally, the choice of the function t varies with differ¬ 
ent pairs of networks. As noted above, we train using a 
fully convolutional architecture. This requires the spatial 
resolution of the two layers i* in ^ and L in to be sim¬ 
ilar, which is trivially true if the architectures ^ and are 
the same. When they are not (for example when we trans¬ 
fer from VGG net to AlexNet), we adjust the padding in the 
AlexNet to obtain the same spatial resolution at pools layer. 

This apart, we introduce an adaptation layer comprising 
of 1 X 1 convolutions followed by ReLU to map from the 
representation at layer L in to layer in This accounts 
for difference in the number of neurons (for example when 
adapting from VGG to AlexNet), or even when the number of 
neurons are the same, allows for domain specific fitting. For 
VGG to AlexNet transfer we also needed to introduce a scal¬ 
ing layer to make the average norm of features comparable 
between the two networks. 

4.1. Transfer to Depth Images 

We first demonstrate how we transfer supervision from 
color images to depth images as obtained from a range sen¬ 
sor like the Microsoft Kinect. As described above, we do 
this set of experiments on the NYUD2 dataset [41] and show 
results on the task of object detection and instance segmen¬ 


tation [17]. The NYUD2 dataset consists of 1449 paired 
RGB and D images. These images come from 464 different 
scenes and were hand selected from the full video sequence 
while ensuring ensure diverse scene content [41]. The full 
video sequence that comes with the dataset has over 400K 
RGB-D frames, we use lOK of these frame pairs for super¬ 
vision transfer. 

In all our experiments we report numbers on the standard 
val and test splits that come with the dataset [41, 17]. Im¬ 
ages in these splits have been selected while ensuring that 
all frames belonging to the same scene are contained en¬ 
tirely in exactly one split. We additionally made sure only 
frames from the corresponding training split were used for 
supervision transfer. 

The downstream task that we study here is that of object 
detection. We follow the experimental setup from Gupta et 
al. [17] for object detection and study the 19 category object 
detection problem, and use mean average precision (mAP) 
to measure performance. 

Baseline Detection Model We use the model from 
Gupta et al. [17] for object detection. Their method builds 
off R-CNN [13]. In our initial experiments we adapted their 
model to the more recent Fast R-CNN framework [12]. We 
summarize our key findings here. First, [17] trained the fi¬ 
nal detector on both RGB and D features jointly. We found 
training independent models all the way and then simply 
averaging the class scores before the SoftMax performed 
better. While this is counter-intuitive, we feel it is plausible 
given limited amount of training data. Second, [17] use fea¬ 
tures from the fc6 layer and observed worse performance 
when using f c7 representation; in our framework where we 
are training completely independent detectors for the two 
modalities, using fc7 representation is better than using 
f c6 representation. Finally, using bounding box regression 
boosts performance. Here we simply average the predicted 
regression target from the detectors on the two modalities. 
All this analysis helped us boost the mean AP on the test 
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Figure 2: Visualization of learned filters (best viewed in color): (a) visualizes filters learned on RGB images from ImageNet data by 
AlexNet. (b) shows these filters after the finetuning on HHA images, and hardly anything changes visually, (c) shows HHA image filters 
from our pre-training scheme, which are much different from ones that are learned on RGB images, (d) shows HHA image filters learned 
without any pre-training, (e) shows optical flow filters learned by [14]. Note that they initialize these filters from RGB filters and these 
also do not change much over their initial RGB filters, (f) shows filters we learn on optical flow images, which are again very different 
from filters learned on RGB or HHA images, (g) shows image patches corresponding to highest scoring activations for two neurons in 
the RGB CNN. (h) shows HHA image patches corresponding to highest scoring activations of the same neuron in the supervision transfer 
depth CNN. (i) shows the corresponding RGB image patch for these depth image patches for ease of visualization. 


set from 38.80% as reported by [17, 16] to 44.39%, us¬ 
ing the same CNN network and supervision. This already 
is the state-of-the-art result on this dataset and we use this 
as a baseline for the rest of our experiments. We denote 
this model as ‘[17] -h Fast R-CNN’. We followed the default 
setup for training Fast R-CNN, 40iC iterations, base learning 
rate of 0.001 and stepping it down by a factor of 10 after 
every 30K iterations, except that we finetune all the layers, 
and use 688px length for the shorter image side. We used 
RGB-D box proposals from [17] for all experiments. 

Note that Gupta et al. [17] embed depth images into a 
geocentric embedding which they call HHA (HHA encodes 
horizontal disparity, height above ground and angle with 
gravity) and use the AlexNet architecture to learn HHA fea¬ 
tures and copy over the weights from the RGB CNN that was 
trained for 1000 way classification [31] on ImageNet [6] io 
initialize this network. All through this paper, we stick with 
using HHA embedding^ to represent the input depth images, 
and their network architecture, and show how our proposed 
supervision transfer scheme improves performances over 
their technique for initialization. We summarize our vari¬ 
ous transfer experiments below: 

Does supervision transfer work? The first question we 
investigate is if we are able to transfer supervision to a new 
modality. To understand this we conducted the following 
three experiments: 

1. no init (lA): randomly initialize the depth network us¬ 
ing weight distributions typically used for training on Ima¬ 
geNet and simply train this network for the final task. While 

^We use the term depth and HHA interchangeably. 


training this network we train for lOOK iterations, start with 
a learning rate on 0.01 and step it down by a factor of 10 
every 3OK iterations. 

2. copy from RGB (IB): copy weights from a RGB net¬ 
work that was trained on ImageNet. This is same as the 
scheme proposed in [17]. This network is then trained us¬ 
ing the standard Fast R-CNN settings. 

3. supervision transfer (1C): train layers convl through 
pools from random initialization using the supervision 
transfer scheme as proposed in Section 3, on the 5K paired 
RGB and D images from the video sequence from NYUD2 
for scenes contained in the training set. We then plug in 
these trained layers along with randomly initialized fc6, 
f c7 and classifier layers for training with Fast R-CNN. We 
report the results in Table 1. We see that ‘copy from RGB’ 
surprisingly does better than ‘no init’, which is consistent 
with what Gupta et al. report in [17], but our scheme for 
supervision transfer outperforms both these baselines by a 
large margin pushing up mean AP from 25.1% to 29.7%. 
We also experimented with using a RGB network that 
has been adapted for object detection on this dataset for 
supervising the transfer (ID) and found that this boosted 
performance further from 29.7% to 30.5% (ID in Table 1, 
AlexNet* indicates RGB AlexNet that has been adapted for 
detection on the dataset). We use this scheme for all subse¬ 
quent experiments. 

Visualizations. We visualize the filters from the first 
layer for these different schemes of transfer in Figure 2(a-f), 
and observe that our training scheme learns reasonable fil¬ 
ters and find that these filters are of different nature than fil- 
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val 


AP^ at 0.5 


AP^ at 0.7 


fc7 

+pool2+conv4 

fc7 

+pool2+conv4 

RGB 

26.3 

29.8 

14.8 

18.3 

D 

28.4 

31.5 

17.4 

19.6 


Table 2: Region detection average precision AP"' on NYUD2 
val set: Performance on NYUD2 val set where we observe similar 
boosts in performance when using hyper-column transform with 
our learned feature hierarchies (learned using supervision transfer 
on depth images) as obtained with more standard feature hierar¬ 
chies learned on ImageNet on RGB images. 

ters learned on RGB images. In contrast, note that schemes 
which initialize depth CNNs with RGB CNNs weights, filters 
in the first layer change very little. We also visualize patches 
giving high activations for neurons paired across RGB and 
D images Figure 2(g-i). High scoring patches from RGB 
CNN (AlexNet in this case), correspond to parts of object (g), 
high scoring patches from the depth CNN also corresponds 
to parts of the same object class (h and i). 

How good is the transferred representation by itself? 
The next question we ask is if our supervision transfer 
scheme transfers good representations or does it only pro¬ 
vide a good initialization for feature learning. To answer 
this question, we conducted the following experiments: 

1. Quality of transferred pools representation (2A, 
2B): The first experiment was to evaluate the quality of the 
transferred pools representation. To do this, we froze the 
network parameters for layers convl through pools to be 
those learned during the transfer process, and only learn pa¬ 
rameters in f c6, f c7 and classifier layers during FastR-CNN 
training (2B ‘supervision transfer adapted (ft fc only)’). 
We see that there is only a moderate degradation in perfor¬ 
mance for our learned features from 30.5% (ID) to 30.0% 
(2B) indicating that the features learned on depth images at 
pools are discriminative by themselves. In contrast, when 
freezing weights when copying from ImageNet (2A), per¬ 
formance degrades significantly to 19.8%. 

2. Improved transfer using better supervising net¬ 
work ^ (2C, 2D): The second experiment investigated if 
performance improves as we improve the quality of the su¬ 
pervising network. To do this, we transferred supervision 
from VGG net instead of AlexNet (2C) VGG net has been 
shown to be better than AlexNet for a variety of vision tasks. 
As before we report performance when freezing parameters 
till pools (2C), and learning all the way (2D). We see that 
using a better supervising net results in learning better fea¬ 
tures for depth images: when the representation is frozen till 
pools we see performance improves from 30.0% to 32.2%, 
and when we finetune all the layers performance goes up to 
33.6% as compared to 30.5% for AlexNet. 

^To transfer from VGG to AlexNet, we use 150 A transfer iterations 
instead of lOOA. Running longer helps for VGG to AlexNet transfer by 
1.5% and much less (about 0.5%) for AlexNet to AlexNet transfer. 


test 

modality 

RGB Arch. 

D Arch. 

AP^ at 0.5 

AP^ at 0.7 

[20] 

RGB 

AlexNet 

- 

23.4 

13.4 

Gupta et al. [ 1 6] 

RGB + D 

AlexNet 

AlexNet 

37.5 

21.8 

Our {supervision transfer) 

RGB + D 

AlexNet 

AlexNet 

40.5 

25.4 

[20] 

RGB 

VGG 

- 

31.0 

17.7 

Our {supervision transfer) 

RGB + D 

VGG 

AlexNet 

42.1 

26.9 


Table 3: Region detection average precision AP"" on NYUD2 
test set. 


Is the learned representation complementary to the 
representation on the source modality? The next ques¬ 
tion we ask is if the representation learned on the depth im¬ 
ages complementary to the representation on the RGB im¬ 
ages from which it was learned. To answer this question 
we look at the performance when using both the modali¬ 
ties together. We do this the same way that we describe for 
the baseline model and simply average the category scores 
and regression targets from the RGB and D detectors. Ta¬ 
ble 1 (right) reports our findings. Just using RGB images 
(3A) gives us a performance of 22.3%. Combining this 
with the HHA network as initialized using the scheme from 
Gupta et al. [17] (3B) boosts performance to 33.8%. Ini¬ 
tializing the HHA network using our proposed supervision 
transfer scheme when transferring from AlexNet* to AlexNet 
(3C) gives us 35.6% and when transferring from VGG* to 
AlexNet (3D) gives us 37.0%. These results show that the 
representations are still complementary and using the two 
together can help the final performance. 

Does supervision transfer lead to meaningful inter¬ 
mediate layer representations? The next questions we in¬ 
vestigate is if the intermediate layers learned in the target 
domain Md through supervision transfer carry useful infor¬ 
mation. [30] hypothesize that information from intermedi¬ 
ate layers in such hierarchies carry information which may 
be useful for fine grained tasks. Jones and Malik [29] and 
Weber and Malik [51] and in more recent work Hariharan et 
al. [20] and Long et al. [36] operationalize this and demon¬ 
strate improvements for fine grained tasks like correspon¬ 
dence estimation and segmentation. Here we investigate if 
the representations learned using supervision transfer also 
share this property. To test this, we follow the hyper-column 
architecture from Hariharan et al. [20] and study the task of 
simultaneous detection and segmentation (SDS) [19] and 
investigate if the use of hyper-columns with our trained net¬ 
works results in similar improvements as obtained when 
using more traditionally trained CNNs. We report the re¬ 
sults in Table 2. On the NYUD2 dataset, the hyper-column 
transform improves AP^ from 26.3% to 29.8% when using 
AlexNet for RGB images. We follow the same experimen¬ 
tal setup as proposed in [18], and fix the CNN parameters 
(to a network that was finetuned for detection on NYUD2 
dataset) and only learn the classifier parameters and use fea¬ 
tures from poo 12 and conv4 layers in addition to f c7 for 
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pooll pool2 conv3 conv4 poolS fc6 fc7 conv3+fc7 

24.4 28.4 30.6 29.9 30.5 29.7 27.7 31.3 

Table 4: Mean AP on NYUD2 val set as a function of layer 
used for supervision transfer. 

figure ground prediction. When doing the same for our su¬ 
pervision transfer network we observe a similar boost in 
performance from 28.4% to 31.5% when using the hyper¬ 
column transform. This indicates that models trained using 
supervision transfer not only learn good representations at 
the point of supervision transfer (pools in this case), but 
also in the intermediate layers of the network. 

How does performance vary as the transfer point is 
changed? We now study how performance varies as we 
vary the layer used for supervision transfer. We stick to the 
same experimental setup as used for Exp. ID in Table 1, 
and conduct supervision transfer at different layers of the 
network. Layers above the transfer point are initialized ran¬ 
domly and learned during detector training. For transfer¬ 
ring features from layers 1 to 5, we use fully convolutional 
training as before. But when transferring f c6 and fc7 fea¬ 
tures we compute them over bounding box proposals (we 
use RGB-D MCG bounding box proposals [17]) using Spa¬ 
tial Pyramid Pooling on conv5 [21, 12]. 

We report the obtained AP on the NYUD2 val set in Ta¬ 
ble 4. We see performance is poor when transferring at 
lower layers (pooll and pool2). Transfer at layers conv3, 
conv4, pools works comparably, but performance deteri¬ 
orates when moving to further higher layers (f c6 and f c7). 
This validates our choice for using an intermediate layer 
as a transfer point. We believe the drop in performance at 
higher layers is an artifact of the amount of data used for 
supervision transfer. With a richer and more diverse dataset 
of paired images we expect transfer at higher layers to work 
similar or better than transfer at mid-layers. 

We also conducted some initial experiments with using 
multiple transfer points. When transferring at conv3 and 
f c7 we observe performance improves over transferring at 
either layer alone, indicating learning is facilitated when su¬ 
pervision is closer to parameters being learned. We defer 
exploration of other choices in this space for future work. 

Is input representation in the form of HHA images 
still important? Given our tool for training CNNs on depth 
images, we can now investigate the question whether hand 
engineering the input representation is still important. We 
conduct an experiment in exactly the same settings as Exp. 
ID except that we work with disparity images (replicated to 
have 3 channels) instead of HHA images. This gives a mAP 
of 29.2% as compared to 30.5% for the HHA images. The 
difference in performance is smaller than what [17] reports 
but still exists, which suggests that encoding information 
into geocentric channels through the HHA embedding is still 
useful [17]. 


Train on MS COCO and adapt to NYUD2 using supervision transfer Train on NYUD2 



bed 

chair 

sink 

sofa 

table 

tv 

toilet 

mean 

mean 

RGB 

51.6 

26.6 

25.1 

43.1 

14.4 

12.9 

57.5 

33.0 

35.7 

D 

59.4 

27.1 

23.8 

32.2 

13.0 

13.6 

43.8 

30.4 

45.0 

RGB + D 

60.2 

35.3 

27.5 

48.2 

16.5 

17.1 

58.1 

37.6 

54.4 


Table 5: Adapting RGB object detectors to RGB-D images: We 

transfer object detectors trained on RGB images (on MS COCO 
dataset) to RGB-D images in the NYUD2 dataset, without using 
any annotations on depth images. We do this by learning a model 
on depth images using supervision transfer and then use the RGB 
object detector trained on the representation learned on depth im¬ 
ages. We report detection AP(%) on NYUD2 test set. These trans¬ 
ferred detectors work well on depth images even without using 
any annotations on depth images. Combining predictions from the 
RGB and depth image improves performance further. 

Applications to zero-shot detection on depth im¬ 
ages. Supervision transfer can be used to transfer detectors 
trained on RGB images to depth images. We do this by the 
following steps. We first train detectors on RGB images. 
We then split the network into two parts at an appropriate 
mid-level point to obtain two networks and . 

We then use the lower domain specific part of the network 
to train a network on depth images to gen¬ 

erate the same representation as the RGB network 
This is done using the same supervision transfer procedure 
as before on a set of unlabeled paired RGB-D images. We 
then construct a Tranken’ network with the lower domain 
specific part coming from and the upper more se¬ 
mantic network coming from . We then simply use 

the output of this franken network on depth images to obtain 
zero-shot object detection output. 

More specifically, we use Fast R-CNN with AlexNet CNN 
to train object detectors on the MS COCO dataset [35]. We 
then split the network right after the convolutional layers 
pools, and train a network on depth images to predict the 
same pools features as this network on unlabeled RGB- 
D images from the NYUD2 dataset (using frames from the 
trainval video sequences). We study all 7 object categories 
that are shared between MS COCO and NYUD2 datasets, and 
report the performance in Table 5. We observe our zero-shot 
scheme for transferring detectors across modalities works 
rather well, and results in good performance. While the 
RGB detector trained on MS COCO obtains a mean AP of 
33.0% on these categories, our zero-shot detector on D im¬ 
ages performs comparably and has a mean AP of 30.4%. 
Note that in doing so we have not used any annotations from 
the NYUD2 dataset (RGB or D images). Furthermore, com¬ 
bining predictions from RGB and D object detectors results 
in boost over just using the detector on the RGB image giv¬ 
ing a performance of 37.6%. Performance when training 
detectors using annotations from the NYUD2 dataset (last 
column in Table 5) is obviously much higher. 

Performance on test set. Finally, we report the perfor- 
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method 

modality 

RGB Arch. 

D Arch. 

mean 

Fast R-CNN [12] 

RGB 

AlexNet 

- 

27.8 

Fast R-CNN [12] 

RGB 

VGG 

- 

38.8 

Gupta et al. [17] 

RGB + D 

AlexNet 

AlexNet 

38.8 

Gupta et al. [16] 

RGB + D 

AlexNet 

AlexNet 

41.2 

Gupta et al. [17] + Fast R-CNN 

RGB + D 

AlexNet 

AlexNet 

44.4 

Our (supervision transfer) 

RGB + D 

AlexNet 

AlexNet 

47.1 

Gupta et al. [17] + Fast R-CNN 

RGB + D 

VGG 

AlexNet 

46.2 

Our (supervision transfer) 

RGB + D 

VGG 

AlexNet 

49.1 

Gupta et al. [17] + Fast R-CNN 

D 

- 

AlexNet 

34.2 

Our (supervision transfer) 

D 

- 

AlexNet 

41.7 


Table 6: Object detection mean AP(%) on NYUD2 test set: We 

compare our performance against several state-of-the-art methods. 
RGB Arch, and D Arch, refers to the CNN architecture used by 
the detector. We see when using just the depth image, our method 
is able to improve performance from 34.2% to 41.7%. When 
used in addition to features from the RGB image, our learned 
features improve performance from 44.4% to 47.1% (when using 
AlexNet RGB features) and from 46.2% to 49.1% (when using 
VGG RGB features) over past methods for learning features from 
depth images. We see improvements across almost all categories, 
performance on individual categories is tabulated in supplemen¬ 
tary material. 

mance of our best performing supervision transfer scheme 
(VGG * —> AlexNet) on the test set in Table 6. When used 
with AlexNet for obtaining color features, we obtain a final 
performance of 47.1% which is about 2.7% higher than the 
current state-of-the-art on this task (Gupta et al. [17] Fast 
R-CNN). We see similar improvements when using VGG for 
obtaining color features (46.2% to 49.1%). The improve¬ 
ment when using just the depth image is much larger, 41.7% 
for our final model as compared to 34.2% for the baseline 
model which amounts to a 22% relative improvement. Note 
that in obtaining these performance improvements we are 
using exactly the same CNN architecture and amount of la¬ 
beled data. We also report performance on the SDS task in 
Table 3 and obtain state-of-the-art performance of 40.5% as 
compared to previous best 37.5% [16] when using AlexNet, 
using VGG CNN for the RGB image improves performance 
further to 42.1%. 

Training Time. Finally, we report the amount of time 
it takes to learn a model using supervision transfer. For 
AlexNet to AlexNet supervision transfer we trained for lOOK 
iterations which took a total of 2.5 hours on a NVIDIA k40 
GPU. This is a many orders of magnitude faster than train¬ 
ing models from random initialization on ImageNet scale 
data using class labels. 

4.2. Transfer to Flow Images 

We now report our experiments for transferring supervi¬ 
sion to optical fiow images. We consider the end task of 


RGB optical flow 

[14] [14] + [12] [14] [14] + [12] Randomlnit Our 

Sup PreTr Sup PreTr No PreTr Sup Transfer 

meanAP 27.0 32.0 24.3 38.4 31.7 35.7 

Table 7: Action Detection AP(%) on the JHMDB test set: We 

report action detection performance on the test set of JHMDB us¬ 
ing RGB or flow images. Right part of the table compares our 
method supervision transfer against the baseline of random initial¬ 
ization, and the ceiling using fully supervised pre-training method 
from [14]. Our method reaches more than half the way towards 
fully supervised pre-training. 

action detection on the JHMDB dataset. The task is to detect 
people doing actions like catch, clap, pick, run, sit 
in frames of a video. Performance is measured in terms of 
mean average precision as in the standard PASCAL VOC 
object detection task and what we used for the NYUD2 ex¬ 
periments in Section 4.1. 

A popular technique for getting better performance at 
such tasks on video data is to additionally use features com¬ 
puted on the optical flow between the current frame and the 
next frame [43, 14], and we use our supervision transfer 
scheme to learn features for optical flow images in this con¬ 
text. 

Detection model For JHMDB we use the experimental 
setup from Gkioxari and Malik [14] and study the 21 class 
task. Here again, Gkioxari and Malik build off of R-CNN 
and we first adapt their system to use Fast R-CNN, and ob¬ 
serve similar boosts in performance as for NYUD2 when go¬ 
ing from R-CNN to Fast R-CNN framework (Table 7, full ta¬ 
ble with per class performance is in the supplementary ma¬ 
terial). We denote this model as [14] -h[ 12]. We attribute this 
large difference in performance to a) bounding box regres¬ 
sion and b) number of iterations used for training. 

Supervision transfer performance We use the videos 
from UCF 101 dataset [47] for our pre-training. Note that 
we do not use any labels provided with the UCF 101 dataset, 
and simply use the videos as a source of paired RGB and 
flow images. We take 5 frames from each of the 9iC videos 
in the trainl set. We report performance on JHMDB test set 
in Table 7. Note that JHMDB has 3 splits and as in past 
work, we report the AP averaged across these 3 splits. 

We report performance for three different schemes for 
initializing the flow model: a) Random Init (No PreTr) 
when the flow network is initialized randomly using the 
weight initialization scheme used for training a RGB model 
on ImageNet, b) Supervised Pre-training ([14]-f[ 12] Sup 
PreTr) on flow images from UCF 101 for the task of 
video classification starting from RGB weights as done by 
Gkioxari and Malik [14] and c) supervision transfer (Our 
Sup Transfer) from an RGB model to train optical flow 
model as per our proposed method. We see that our scheme 
for supervision transfer improves performance from 31.7% 
achieved when using random initialization to 35.7%, which 
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is more than half way towards what fully supervised pre¬ 
training can achieve (38.4%), thereby illustrating the effi¬ 
cacy of our adaptation scheme. 

Conclusion We have presented an algorithm for trans¬ 
fer of learned representations from a well labeled modal¬ 
ity to new unlabeled modalities using unlabeled paired 
images from the two modalities. This enables us to 
learn rich representations on unlabeled modalities and ob¬ 
tain large boosts in performance. We believe the ad¬ 
vances presented in this paper will allow us to effectively 
use new modalities for obtaining better performance on 
standard vision tasks. Code, data and pretrained mod¬ 
els are available at https : //github . com/s-gupta/ 
fast-rcnn/tree/distillation. 
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A. Supplementary Material 

1. Per category average precision: We report per cate¬ 
gory numbers for summary tables on test sets in the main 
paper. 

2. Sample Detection and SDS output: We show sam¬ 
ple detections and SDS output for the categories we 
study. We sample 18 detections uniformly from the top k 
(= 0.75 X number of instances) detections for each cat¬ 
egory: bed (Figure 3), chair (Figure 4), sofa (Figure 5), 
toilet (Figure 6), table (Figure 7). 


B. Document Changelog 

vl Initial version 

v2 Major changes: additional discussion of multi-modal 
literature, visualization of neural activations in Figure 2(g- 
i), additional experiments about quality of intermediate lay¬ 
ers, performance as a function of transfer point, utility of 
HHA embedding over disparity images, zero-shot detection 
on depth images. Minor edits all over the text. 
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bed det# 0: 1.000 bed det# 7: 1.000 bed det# 14: 1.000 bed det# 22: 0.999 bed det# 29: 0.999 bed det# 36: 0.999 



bed det# 88: 0.982 bed det# 95: 0.981 bed det# 103: 0.968 bed det# 110: 0.957 bed det# 117: 0.946 bed det# 125: 0.914 



Figure 3: Sample detections and segmentation masks for bed on NYUD2 test set. 


chair det# 0: 1.000 chair det# 38: 0.999 chair det# 76: 0.998 chair det# 114: 0.996 chair det# 152: 0.993 chair det# 190: 0.990 



Figure 4: Sample detections and segmentation masks for chair on NYUD2 test set. 



sofa det# 62: 0.993 sofa det# 73: 0.991 sofa det# 83: 0.986 sofa det# 94: 0.980 sofa det# 104: 0.970 sofa det# 115: 0.959 



Figure 5: Sample detections and segmentation masks for sofa on NYUD2 test set. 
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toilet det# 3: 1.000 toilet det# 4: 0.999 toilet det# 6: 0.999_ toile t det# 7: 0.999 



toilet det# 12: 0.992 toilet det# 13: 0.988 ^ _ toilet det# 15: 0.9 60 _ toilet det# 16: 0.938 






toilet det# 24: 0.419 


toilet det# 25: 0.406 




1 

Ji 

... 

_J 


Figure 6: Sample detections and segmentation masks for toilet on NYUD2 test set. 



table det# 82: 0.955 table det# 96: 0.934 table det# 110: 0.906 table det# 123: 0.880 table det# 137: 0.845 table det# 151: 0.807 



Figure 7: Sample detections and segmentation masks for table on NYUD2 test set. 


method 

modality 

RGB 

Arch. 

D Arch. 

metric 

bath tub 

bed 

book shelf 

box 

chair 

counter 

desk 

door 

dresser 

garbage 

bin 

lamp 

monitor 

night 

stand 

pillow 

sink 

sofa 

table 

tele vision 

toilet 

mean 

[20] 

RGB 

AlexNet 


AP'^ 0.5 

8.9 

45.2 

12.6 

1.4 

20.6 

24.3 

4.2 

19.3 

27.2 

20.1 

26.2 

39.2 

21.3 

23.7 

27.6 

25.2 

8.2 

35.3 

54.3 

23.4 

Gupta eta/. [16] 

RGB + D 

AlexNet 

AlexNet 

AP'^ 0.5 

42.0 

65.1 

12.7 

5.1 

42.0 

42.1 

9.5 

20.5 

38.0 

50.3 

32.8 
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38.2 

42.0 

39.4 
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14.8 

48.0 
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Our {supervision transfer) 

RGB + D 

AlexNet 

AlexNet 

AP'^ 0.5 

31.5 

68.7 

22.3 

4.0 

39.6 

43.3 

11.2 

25.1 

52.1 

42.5 

45.0 

61.8 

47.5 

41.3 

48.5 

49.7 

18.1 

49.5 

68.4 

40.5 
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RGB 

VGG 


AP'^ 0.5 
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42.2 

32.9 

34.3 
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9.6 

44.1 

62.5 

31.0 

Our {supervision transfer) 
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42.1 

[20] 

RGB 

AlexNet 


AP'^ 0.7 

5.9 

28.6 

2.8 

0.6 

6.6 

6.0 

1.2 

9.6 

16.8 

15.8 

8.4 

16.1 

17.1 

15.5 

11.9 

12.9 

1.8 

33.7 

44.0 

13.4 

Gupta eta/. [16] 

RGB + D 

AlexNet 

AlexNet 

AP^ 0.7 

13.8 

46.0 

2.4 

3.0 

17.3 

15.0 

2.6 

9.9 

25.8 

45.4 

6.9 

37.5 

24.3 

25.5 

19.6 

27.9 

7.6 

44.9 

38.7 

21.8 

Our {supervision transfer) 

RGB + D 

AlexNet 

AlexNet 

AP^ 0.7 

13.3 

50.6 

5.3 

1.3 

15.9 

14.2 

2.6 

15.6 

50.0 

34.0 

14.0 

36.4 

33.8 

26.3 

20.8 

27.7 

6.9 

44.9 

68.4 

25.4 

[20] 

RGB 

VGG 


AP'^ 0.7 

6.6 

35.7 

0.4 

1.6 

9.4 

7.2 

1.1 

16.5 

29.3 

29.1 

11.3 

33.3 

19.5 

19.9 

17.2 

17.9 

1.7 

35.7 

43.4 

17.7 

Our {supervision transfer) 

RGB + D 

VGG 

AlexNet 

AP^ 0.7 

13.0 

56.1 

6.9 

2.5 

17.9 

14.8 

4.3 

18.6 

51.7 

36.2 

16.2 

42.2 

32.3 

26.9 

20.4 

32.5 

6.3 

44.4 

68.7 

26.9 


Table 8: Region Detection AP^ (%) on NYUD2 test set: We report per class AP^ for the SDS experiments in Table 3 in the main paper. 
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modality RGB Arch. D Arch. 


I i 


Fast R-CNN [12] 
Fast R-CNN [12] 


RGB AlexNet 
RGB VGG 


7.9 51.2 37.0 

37.4 69.1 47.0 


1.5 31.3 35.4 9.4 22.4 28.9 19.3 31.0 35.9 24.1 26.4 24.6 39.7 16.6 32.9 53.5 27.8 

2.9 44.4 48.6 11.5 28.7 43.1 33.6 32.9 50.9 32.6 34.4 39.0 50.3 24.5 44.1 61.5 38.8 


Gupta etal. [17] 

RGB + D 

AlexNet 

AlexNet 

36.4 

70.8 

35.1 

3.6 

47.3 

46.8 

14.9 

23.3 

38.6 

43.9 

37.6 

52.7 

40.7 

42.4 

43.5 

51.6 

22.0 

38.0 

47.7 

38.8 

Gupta etal. [16] 

RGB + D 

AlexNet 

AlexNet 

39.4 

73.6 

38.4 

5.9 

50.1 

47.3 

14.6 

24.4 

42.9 

51.5 

36.2 

52.1 

41.5 

42.9 

42.6 

54.6 

25.4 

48.6 

50.2 

41.2 

Gupta et al. [17] + Fast R-CNN 

RGB + D 

AlexNet 

AlexNet 

37.1 

78.3 

48.5 

3.3 

45.3 

54.6 

21.9 

28.5 

48.6 

41.9 

42.5 

60.6 

49.2 

43.7 

40.2 

62.1 

29.2 

44.3 

63.6 

44.4 

Our {supervision transfer) 

RGB + D 

AlexNet 

AlexNet 

45.6 

78.7 

48.5 

4.3 

50.5 

57.8 

21.4 

29.6 

54.0 

41.6 

45.4 

61.2 

57.9 

47.3 

48.9 

63.2 

29.5 

50.0 

60.1 

47.1 

Gupta et al. [17] + Fast R-CNN 

RGB + D 

VGG 

AlexNet 

47.2 

80.4 

52.8 

4.2 

49.7 

53.0 

22.4 

33.7 

52.1 

44.4 

39.2 

64.6 

47.5 

45.1 

42.1 

63.2 

31.4 

42.1 

63.0 

46.2 

Our {supervision transfer) 

RGB + D 

VGG 

AlexNet 

50.6 

81.0 

52.6 

5.4 

53.0 

56.1 

20.9 

34.6 

57.9 

46.2 

42.5 

62.9 

54.7 

49.1 

50.0 

65.9 

31.9 

50.1 

68.0 

49.1 

Gupta et al. [17] + Fast R-CNN 

D 


AlexNet 

28.8 

79.1 

30.3 

1.5 

42.6 

42.7 

17.2 

13.4 

31.6 

23.7 

29.9 

40.2 

36.2 

40.5 

23.4 

59.9 

26.4 

24.9 

58.3 

34.2 

Our {supervision transfer) 

D 


AlexNet 

31.2 

80.7 

38.6 

2.5 

52.2 

52.2 

17.2 

18.2 

50.8 

35.1 

37.4 

51.3 

50.5 

43.4 

41.0 

63.5 

29.3 

37.4 

59.8 

41.7 


Table 9: Object Detection AP(%) on NYUD2 test set: We compare our performance against several state-of-the-art methods. RGB Arch, 
and D Arch, refers to the CNN architecture used by the detector. We see when using just the depth image, our method is able to improve 
performance from 34.2% to 41.7%. When used in addition to features from the RGB image, our learned features improve performance 
from 44.4% to 47.1% (when using AlexNet RGB features) and from 46.2% to 49.1% (when using VGG RGB features) over past methods 
for learning features from depth images. Analogous to summary Table 6 in the main paper. 


method 

Supervision 

modality 

brush hair 

catch 

clap 

climb 

stairs 

golf 

jump 

kick ball 

pick 

pour 

pullup 

push 

c 

g 

shoot ball 

shoot bow 

shoot gun 


stand 

swing 

baseball 

throw 

walk 

wave 

mean 

Gkioxari etal. [14] 


RGB 

55.8 

25.5 

25.1 

24.0 

77.5 

1.9 

5.3 

21.4 

68.6 

71.0 

15.4 

6.3 

4.6 

41.1 

28.0 

9.4 

8.2 

19.9 

17.8 

29.2 

11.5 

27.0 

Gkioxari et al. [14]+Fast R-CNN 


RGB 

47.2 

35.2 

30.1 

23.9 

84.4 

2.2 

10.6 

20.7 

79.7 

78.7 

25.2 

14.4 

8.7 

45.3 

34.2 

11.7 

13.3 

39.0 

19.1 

23.9 

23.9 

32.0 

Gkioxari etal. [14] 

Sup. PreTr. 

flow 

32.3 

5.0 

35.6 

30.1 

58.0 

7.8 

2.6 

16.4 

55.0 

72.3 

8.5 

6.1 

3.9 

47.8 

7.3 

24.9 

26.3 

36.3 

4.5 

22.1 

7.6 

24.3 

Gkioxari et al. [14]+Fast R-CNN 

Sup. PreTr. 

flow 

54.9 

17.0 

52.5 

56.5 

81.2 

15.0 

10.9 

28.9 

72.7 

86.6 

20.4 

17.5 

10.2 

61.9 

25.5 

31.4 

42.4 

53.8 

10.9 

38.6 

17.3 

38.4 

no init 

No PreTr. 

flow 

44.3 

11.0 

42.8 

38.7 

76.1 

10.6 

6.6 

23.1 

62.1 

84.0 

15.4 

9.6 

6.8 

60.0 

22.8 

29.6 

26.8 

43.5 

10.7 

30.8 

9.8 

31.7 

Our {supervision transfer) 

Sup. Transfer 

flow 

54.6 

17.7 

45.1 

54.9 

80.3 

14.6 

9.7 

28.2 

69.3 

84.8 

19.9 

15.6 

7.2 

49.6 

29.4 

29.5 

28.4 

49.5 

11.6 

36.3 

13.0 

35.7 


Table 10: Action Detection AP(%) on the JHMDB test set: We report action detection performance on the test set of JHMDB using 
RGB or flow images. Bottom part of the table, compares our method supervision transfer against the baseline of random initialization, 
and the ceiling using fully supervised pre-training method from [14]. Our method reaches more than half the way towards fully supervised 
pre-training. Analogous to Table 7 in the main paper. 
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